Skip to content

Conversation

@suyoggupta
Copy link

@suyoggupta suyoggupta commented Jul 27, 2025

@coderabbitai summary

pytorch:

                                                                                                                                                
===========================================================                                                                                     
= PERFORMANCE OVERVIEW                                                                                                                          
===========================================================                                                                                     
Request Throughput (req/sec):                     71.5680                                                                                       
Total Output Throughput (tokens/sec):             9160.7029                                                                                     
Total Token Throughput (tokens/sec):              18321.4059                                                                                    
Total Latency (ms):                               3577.0181                                                                                     
Average request latency (ms):                     3504.4893                                                                                     
Per User Output Throughput [w/ ctx] (tps/user):   36.5284
Per GPU Output Throughput (tps/gpu):              9160.7029

AutoDeploy:

                                                                                                                                                
===========================================================                                                                                     
= PERFORMANCE OVERVIEW                                                                                                                          
===========================================================                                                                                     
Request Throughput (req/sec):                     67.6845                                                                                       
Total Output Throughput (tokens/sec):             8663.6133                                                                                     
Total Token Throughput (tokens/sec):              17327.2265                                                                                    
Total Latency (ms):                               3782.2556                                                                                     
Average request latency (ms):                     3704.9923                                                                                     
Per User Output Throughput [w/ ctx] (tps/user):   34.5515                                                                                       
Per GPU Output Throughput (tps/gpu):              8663.6133                                                                                     
                                                 

Copilot AI review requested due to automatic review settings July 27, 2025 08:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes GPU memory usage by avoiding unnecessary CPU transfers when processing new tokens in the auto deploy executor. Instead of copying new tokens from GPU to CPU and back, the changes preserve the new tokens on GPU throughout the processing pipeline.

  • Replace CPU conversion of new_tokens with direct GPU tensor handling
  • Update sequence information interface to accept GPU tensors directly
  • Add NVTX profiling ranges for performance monitoring
  • Optimize tensor creation with pinned memory for faster transfers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py Removes CPU conversion of new_tokens and passes GPU tensors directly to sequence interface
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Updates sequence interface to handle GPU tensors and adds performance optimizations

max_num_tokens=max_num_tokens,
)

print(" in seq_info for device: ", torch.cuda.current_device())
Copy link

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed before merging to production. This appears to be leftover debugging code.

Suggested change
print(" in seq_info for device: ", torch.cuda.current_device())
ad_logger.info(f"in seq_info for device: {torch.cuda.current_device()}")

Copilot uses AI. Check for mistakes.
Comment on lines +420 to +421
self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]

Copy link

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indexing assumes new_tokens has at least 3 dimensions and previous_batch_indices is valid, but there's no bounds checking. If previous_batch_indices contains invalid indices or new_tokens has different dimensions, this will cause a runtime error.

Suggested change
self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]
# Validate new_tokens dimensions
if new_tokens.dim() < 3:
raise ValueError(f"new_tokens must have at least 3 dimensions, but got {new_tokens.dim()}.")
# Validate previous_batch_indices
max_index = new_tokens.size(1) - 1
if any(idx < 0 or idx > max_index for idx in previous_batch_indices):
raise IndexError(f"previous_batch_indices contains out-of-bounds indices for new_tokens' second dimension (valid range: 0 to {max_index}).")
self.input_ids[self.input_ids == -1] = new_tokens[0, previous_batch_indices, 0]

Copilot uses AI. Check for mistakes.
Comment on lines +125 to 128
self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)
self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)
self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)
self.input_pos = torch.empty_like(self.seq_len)
Copy link

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tensor is created with device=self.device, but self.device is a property that depends on self.input_pos.device. At this point in post_init, input_pos hasn't been initialized yet, which could cause an AttributeError.

Suggested change
self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)
self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)
self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)
self.input_pos = torch.empty_like(self.seq_len)
self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)
self.input_pos = torch.empty_like(self.seq_len)
self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)
self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants